Automatically Constructing a Normalisation Dictionary for Microblogs
نویسندگان
چکیده
Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words. In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (e.g. tomorrow for tmrw). We use context information to generate possible variant and normalisation pairs and then rank these by string similarity. Highlyranked pairs are selected to populate the dictionary. We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset. Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing. 1 Lexical Normalisation A staggering number of short text “microblog” messages are produced every day through social media such as Twitter (Twitter, 2011). The immense volume of real-time, user-generated microblogs that flows through sites has been shown to have utility in applications such as disaster detection (Sakaki et al., 2010), sentiment analysis (Jiang et al., 2011; González-Ibáñez et al., 2011), and event discovery (Weng and Lee, 2011; Benson et al., 2011). However, due to the spontaneous nature of the posts, microblogs are notoriously noisy, containing many non-standard forms — e.g., tmrw “tomorrow” and 2day “today” — which degrade the performance of natural language processing (NLP) tools (Ritter et al., 2010; Han and Baldwin, 2011). To reduce this effect, attempts have been made to adapt NLP tools to microblog data (Gimpel et al., 2011; Foster et al., 2011; Liu et al., 2011b; Ritter et al., 2011). An alternative approach is to pre-normalise non-standard lexical variants to their standard orthography (Liu et al., 2011a; Han and Baldwin, 2011; Xue et al., 2011; Gouws et al., 2011). For example, se u 2morw!!! would be normalised to see you tomorrow! The normalisation approach is especially attractive as a preprocessing step for applications which rely on keyword match or word frequency statistics. For example, earthqu, eathquake, and earthquakeee — all attested in a Twitter corpus — have the standard form earthquake; by normalising these types to their standard form, better coverage can be achieved for keyword-based methods, and better word frequency estimates can be obtained. In this paper, we focus on the task of lexical normalisation of English Twitter messages, in which out-of-vocabulary (OOV) tokens are normalised to their in-vocabulary (IV) standard form, i.e., a standard form that is in a dictionary. Following other recent work on lexical normalisation (Liu et al., 2011a; Han and Baldwin, 2011; Gouws et al., 2011; Liu et al., 2012), we specifically focus on one-to-one normalisation in which one OOV token is normalised to one IV word. Naturally, not all OOV words in microblogs are lexical variants of IV words: named entities, e.g., are prevalent in microblogs, but not all named entities are included in our dictionary. One challenge for lexical normalisation is therefore to dis-
منابع مشابه
Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach
User-generated content has become a recurrent resource for NLP tools and applications, hence many efforts have been made lately in order to handle the noise present in short social media texts. The use of normalisation techniques has been proven useful for identifying and replacing lexical variants on some of the most informal genres such as microblogs. But annotated data is needed in order to ...
متن کاملImproving Microblog Retrieval from Exterior Corpus by Automatically Constructing a Microblogging Corpus
A large-scale training corpus consisting of microblogs belonging to a desired category is important for highaccuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and laborconsuming. Therefore, some models for the automatic retrieval of microblogs from an exterior corpus have been proposed. However, these approaches may fail in considering microblog...
متن کاملImproving Microblog Retrieval from Exterior Corpus by Automatically Constructing Microblogging Corpus
A large-scale training corpus consisting of microblogs belonging to a desired category is important for highaccuracy microblog retrieval. Obtaining such a large-scale microblgging corpus manually is very time and laborconsuming. Therefore, some models for the automatic retrieval of microblogs from an exterior corpus have been proposed. However, these approaches may fail in considering microblog...
متن کاملA Supervised Method for Constructing Sentiment Lexicon in Persian Language
Due to the increasing growth of digital content on the internet and social media, sentiment analysis problem is one of the emerging fields. This problem deals with information extraction and knowledge discovery from textual data using natural language processing has attracted the attention of many researchers. Construction of sentiment lexicon as a valuable language resource is a one of the imp...
متن کاملAutomatically constructing Wordnet Synsets
Manually constructing a Wordnet is a difficult task, needing years of experts’ time. As a first step to automatically construct full Wordnets, we propose approaches to generate Wordnet synsets for languages both resource-rich and resource-poor, using publicly available Wordnets, a machine translator and/or a single bilingual dictionary. Our algorithms translate synsets of existing Wordnets to a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012